4.Main Analysis
4.1 Which accomdation type is most popular?
Rating score is a effective way to discover popularity, we first want to analyze the quality of rating scores of this data set Firstly, we assume that review score is critical, it is an indicator for a nice host, which is valued for Airbnb customer.
ggplot(airbnb, aes(x = review_scores_rating)) +
geom_histogram(binwidth = 1, color = 'black', fill = "lightblue") +
ggtitle("Histogram of Review Scores Rating") +
xlab("Review Scores Rating") +
ylab("Count") +
theme(plot.title = element_text(hjust = 0.5))
There are 1148 observations with missing value. Most listings’ scores are over 80 and the mode is 100. This data also have rounding patterns. When scores are less than 60, only values of 20, 40, 50 and 60 are found.
When people are rating, they tend to give same scores on all aspects, so there are rounding patterns. People are more likely to rate their stay if thery are pretty satisfied. Therefore, there are many full scores.
ggplot(airbnb, aes(x = number_of_reviews, y = review_scores_rating)) +
geom_point(stroke = 0, alpha = 0.3, color = 'blue') +
ggtitle("Number of Reviews v.s. Review Scores Rating") +
xlab("Number of Reviews") +
ylab("Review Scores Rating") +
theme(plot.title = element_text(hjust = 0.5))
We can see from the scatterplot that there are few listings with many reviews(over 100) and a low review score(less than 80). When people are choosing where to stay, they prefer listings with higher rating scores, so high-score listings will in general get more reviews. When number of reviews surpass certain threshold(aproximately over 300), the score tend to be higher which indicate some consistently good host.
Since the quality of reviews in this data set meet our expectation, we can use this fact to analyze how people think on different type of accomodations.
ggplot(airbnb, aes(x=review_scores_rating,group = room_type, color = room_type)) +
geom_density(alpha = .3) +
ggtitle("Density Curve of Review Scores Rating Group by Room Type") +
xlab("Review Scores Rating") +
ylab("Density") +
scale_color_discrete(name = "Room Type") +
theme(plot.title = element_text(hjust = 0.5))
Above is a graph showing review score distribution for three different type of accommodations, entire home, private room and shared room, we want to find out which type of room is favored by travelers. The rating score density is getting sharper when people have more private space, which clearly stated that people will be more satisfied when they interact less with others. Compare to interact with strangers, either another traveler (shared room) or household (private), it’s always better to have a quiet family, or alone time after a day of trip when you have control of the entire home. Giving customer more personal space would be a very positive factor to make them satisfied.
4.2 Whats the most effective driver for the listing price?
4.2.1 The first thing comes to our mind is the review score, is a nice host has the power to charge more?
ggplot(airbnb, aes(x = price, y = review_scores_rating)) +
geom_point(stroke = 0, alpha = 0.3, color = 'blue') + xlim(0,2500) +
ggtitle("Price v.s. Review Scores Rating") +
xlab("Price") +
ylab("Review Scores Rating") +
theme(plot.title = element_text(hjust = 0.5))
Clearly, very few points laid in the lower right part of the graph, which indicates that people stay in higher priced rooms are very less likely to be unsatisfied since they are more probable to have a good experience with nicer room quality or better service. However, we can not be affirmative that high review scores can drive price up since there rooms with lower price also bring high review scores.
Next, we assumed that customer would love a larger space so they can stay more comfortably.
ggplot(SFPrice, aes(x = square_feet,y = price)) +
geom_point(alpha = 0.3, color = "blue",stroke = 0) +
geom_density_2d(color = "maroon") +
ylim(0,700) +
xlab("Square Feet") +
ylab("Price") +
ggtitle("Price v.s Square Feet") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5))
Above figure depicts the price distribution in San Francisco based on Square Feet, we cannot suggest a significant relationship between room size and price. What we can infer from the graph is that there is a positive correlation between price and room size but the correlation is not significant.
4.2.2 Description
Most of hosts will write a description for their listings, and we assume that people would value what they say in the description. We plot a word cloud to see what are they talking about the most
set.seed(123)
wordcloud(words = word_counts_description$words, freq = word_counts_description$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
We will pick “private” and “kitchen” to do further analysis.
Then we would control on some keyword to see if there’s a effect on it. We processed each of the decescription to mine the keywords and devide the data into 2 groups to check the effect.
SFPrice_Description$private <- paste("private", SFPrice_Description$private)
g1 = SFPrice_Description%>%
mutate(description = fct_reorder(as.factor(private),desc(price), fun = median)) %>%
ggplot(aes(x =price, y = as.factor(private),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
xlim(0,1000) +
xlab("Price") +
ylab("Description") +
ggtitle("Distribution of Each Key Word Respectively in Description") +
theme(plot.title = element_text(hjust = 0.5))
SFPrice_Description$kitchen <- paste("kitchen", SFPrice_Description$kitchen)
g2 = SFPrice_Description%>%
mutate(description = fct_reorder(as.factor(kitchen),desc(price), fun = median)) %>%
ggplot(aes(x =price, y = as.factor(kitchen),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu") +
xlim(0,1000) +
xlab("Price") +
ylab("Description")
grid.arrange(g1,g2,nrow = 2)
Price distribution conditional on whether “private” or “kitchen” is being mentioned in the room description or not. From the ridgeline plot, we cannot see definite difference between the price distribution of the two.
check_d = c(colnames(SFPrice_Description)[99:100])
Data_Des <-SFPrice_Description %>% gather(key = description, value,check_d)
Data_Des$description <- Data_Des$value
ggplot(Data_Des,aes(x = description, y = price, fill = description)) +
geom_boxplot() +
ggtitle("Price v.s Description") +
scale_x_discrete(name = "Description") +
theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,500) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Boxplot to analyze whether there is a significant difference in the median price conditional on whether word kitchen or “private” is included in the description or not. A boxplot pick out some subtle difference bettwen the median price of rooms include “private” in the descrption and those that don’t include “private”
4.2.3 Cleaning Fee
We assume that customers of Airbnb is price sensitive, and so does the cleaning fee. We want to know if there is a relationship between cleaning fee and the total price.
SFPrice = fread("listings.csv",header = T, sep = ',')
SFPrice$cleaning_fee = as.numeric(gsub('[$,]', '', SFPrice$cleaning_fee))
SFPrice$price = as.numeric(gsub('[$,]', '', SFPrice$price))
ggplot(SFPrice, aes(x = cleaning_fee,y = price)) +
geom_point(alpha = 0.3, color = "blue",stroke = 0) +
geom_density_2d(color = "maroon") +
ylim(0,700) +
xlim(0,300) +
xlab("Cleaning Fee") +
ylab("Price") +
ggtitle("Price v.s Cleaning Fee") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_classic()
Above figure depicts the price distribution in San Francisco based on cleaning fee. It appears that the result is very similar to that of room size, a insignficant positive correlation.
4.2.4 Location.
We break the listings by zip code and see if location drives the price.
SFPrice <- SFPrice[!is.na(SFPrice$zipcode)]
SFPrice%>%
mutate(zipcode = fct_reorder(as.factor(zipcode),desc(price), fun = median)) %>%
ggplot(aes(x =price, y = as.factor(zipcode),fill = ..x..)) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
scale_fill_distiller(name = "Price", palette = "GnBu")+
xlim(0,1000) +
ylab("Zipcode") +
xlab("Price") +
ggtitle("Zipcode Distribution Comparison") +
theme(plot.title = element_text(hjust = 0.5))
Price distribution group by zipcode. From the figure, we can deduce that zip code play a essentail role in determing the pricing of room rentals.
SFPrice <- SFPrice[!is.na(SFPrice$zipcode)]
SFPrice%>%
mutate(zipcode = fct_reorder(as.factor(zipcode),desc(price), fun = median)) %>%
ggplot(aes(x = zipcode, y = price, fill = zipcode)) +
geom_boxplot() +
ggtitle("Zipcode v.s Price") +
scale_x_discrete(name = "Zip Code") +
theme(plot.title = element_text(hjust = 0.5)) +
ylim(0,850) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
A boxplot further confirmed our gusses. The price distribution conditional on zip code varies relatively more than it conditional on room size and room description.
medians <- airbnb %>% group_by(zipcode) %>%
summarize(median = median(na.omit(price))) %>%
transmute(region = zipcode, value = median)
medians$region <- as.character(medians$region)
medians <- na.omit(medians)
medians <- subset(medians, region != 94106 & region != 94113 & region != 94965 &
region != 94510 & region != 94014)
zip_choropleth(medians, county_zoom = 6075, num_colors = 6,
title = "Median Price by Zipcode") +
scale_fill_brewer(palette = "GnBu", na.value = "white",
guide_legend(title = "Median Price"),
labels = c("89-125","125-150","150-156","156-178","178-180","180-500","No Data"))
## Scale for 'fill' is already present. Adding another scale for 'fill',
## which will replace the existing scale.
We can observe that the dark, pricy areas are clustered at the downtown area, which is a great way to indicate that how location affect the median price.